3 research outputs found

    Exploiting and coping with sparsity to accelerate DNNs on CPUs

    Get PDF
    Deep Neural Networks (DNNs) have become ubiquitous, achieving state-of-the-art results across a wide range of tasks. While GPUs and domain specific accelerators are emerging, general-purpose CPUs hold a firm position in the DNN market due to their high flexibility, high availability, high memory capacity, and low latency. Various working sets in DNN workloads can be sparse, i.e., contain zeros. Depending on the source of the sparsity, the level of the sparsity varies. First, when the level is low enough, traditional sparse algorithms are not competitive against dense algorithms. In such cases, the common practice is to apply dense algorithms on uncompressed sparse inputs. However, this implies that a fraction of the computations are ineffectual because they operate on zero-valued inputs. Second, when the level is high, one may apply traditional sparse algorithms on compressed sparse inputs. Although such approach does not induce ineffectual computations, the indirection in a compressed format often causes irregular memory accesses, hampering the performance. This thesis studies how to improve DNN training and inference performance on CPUs by both discovering work-skipping opportunity in the first case and coping with the irregularity in the second case. To tackle the first case, this thesis proposes both a pure software approach and a software-transparent hardware approach. The software approach is called SparseTrain. It leverages the moderately sparse activations in Convolutional Neural Networks (CNNs) to speed up their training and inference. Such sparsity changes dynamically and is unstructured, i.e. it has no discernible patterns. SparseTrain detects the zeros inside a dense representation and dynamically skips over useless computations at run-time. The hardware approach is called the Sparsity Aware Vector Engine (SAVE). SAVE exploits the unstructured sparsity in both the activations and the weights. Similar to SparseTrain, SAVE also dynamically detects zeros in a dense representation and then skips ineffectual work. SAVE augments a CPU's vector processing pipeline. It assembles denser vector operands by combining effectual vector lanes from multiple vector instructions that contain ineffectual lanes. SAVE is general purpose. It accelerates any vector workload that has zeros in the inputs. Nonetheless, it contains optimizations targeting matrix multiplication based DNN models. Both SparseTrain and SAVE accelerate DNN training and inference on CPUs significantly. For the second case, this thesis focuses on a type of DNN that is severely impacted by the irregularity from sparsity --- Graph Neural Networks (GNNs). GNNs take graphs as the input, and graphs often contain highly sparse connections. This thesis proposes software optimizations that (i) overlap the irregular memory accesses with the compute, (ii) compress and decompress the features dynamically, and (iii) improve the temporal reuse of the features. The optimized implementation significantly outperforms a state-of-the-art GNN implementation. In addition, this thesis discusses the idea of offloading a GNN's irregular memory access phase to an augmented Direct Memory Access (DMA) engine, as a future work

    Towards an Achievable Performance for the Loop Nests

    Full text link
    Numerous code optimization techniques, including loop nest optimizations, have been developed over the last four decades. Loop optimization techniques transform loop nests to improve the performance of the code on a target architecture, including exposing parallelism. Finding and evaluating an optimal, semantic-preserving sequence of transformations is a complex problem. The sequence is guided using heuristics and/or analytical models and there is no way of knowing how close it gets to optimal performance or if there is any headroom for improvement. This paper makes two contributions. First, it uses a comparative analysis of loop optimizations/transformations across multiple compilers to determine how much headroom may exist for each compiler. And second, it presents an approach to characterize the loop nests based on their hardware performance counter values and a Machine Learning approach that predicts which compiler will generate the fastest code for a loop nest. The prediction is made for both auto-vectorized, serial compilation and for auto-parallelization. The results show that the headroom for state-of-the-art compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to 1.71x for the auto-parallelized code. These results are based on the Machine Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and Compilers for Parallel Computing (LCPC 2018
    corecore